Predictive models are used to guess (statisticians would say: predict) values of a variable of interest based on other variables. As an example, consider prediction of sales based on historical data, prediction of risk of heart disease based on patient’s characteristics, or prediction of political attitudes based on Facebook comments.
Predictive models have been constructed through the whole human history. Ancient Egyptians, for instance, used observations of rising of Sirius to predict flooding of the Nile. A more rigorous approach to model construction may be attributed to the method of least squares, published more than two centuries ago by Legendre in 1805 and by Gauss in 1809. With time, the number of applications in economy, medicine, biology,and agriculture was growing. The term regression was coined by Francis Galton in 1886. Initially, it was referring to biological applications, while today it is used for various models that allow prediction of continuous variables. Prediction of nominal variables is called classification, and its beginning may be attributed to works of Ronald Fisher in 1936.
During the last century, many statistical models that can be used for predictive purposes have been developed. These include linear models, generalized linear models, regression and classification trees, rule-based models, and many others. Developments in mathematical foundations of predictive models were boosted by increasing computational power of personal computers and availability of large datasets in the era of ,,big data’’ that we have entered.
With the increasing demand for predictive models, model features such as flexibility, ability to perform internally some feature engineering, and high precision of predictions are of interest. To obtain robust models, ensembles of models are used. Techniques like bagging, boosting, or model stacking combine hundreds or thousands of small models into a one super-model. Large deep neural models have over a bilion of parameters.
There is a cost of this progress. Complex models may seem to operate like ,,black boxes’‘. It may be difficult, or even impopssible, to understand how thousands of coefficients affect the model prediction. At the same time, complex models may not work as good as we would like them to do. An overview of real problems with large black-box models may be found in an excellent book of Cathy O’Neil (O’Neil 2016) or in her TED Talk ,,The era of blind faith in big data must end’’. There is a growing number of examples of predictive models with performance that deteriorated over time or became biased in some sense. See, for instance, the issues related to the flu epidemic predictions by the Google Flu Trends model [Lazer et al Science 2014] or the problems with cancer recommndations based on the IBM Watson model [https://www.statnews.com/2017/09/05/watson-ibm-cancer/].
Today the true bottleneck in predictive modelling is not the lack of data, nor the lack of computational power, nor the lack of flexible models. It is the lack of tools for model validation, model exploration, and explanation of model decisions. Thus, in this book, we present a collection of methods that may be used for this purpose. As development of such methods is a very active area of research and new methods become available almost on a continuous basis, we do not aim at being exhaustive. Rather, we present the mind-set, key problems, and several examples of methods that can be used in model exploration.
Seventy-six years ago Isaac Asimov forumlated Three Laws of Robotics: 1) a robot may not injure a human being, 2) a robot must obey the orders given it by human beings, and 3) A robot must protect its own existence.
Today’s robots, like cleaning robots, robotic pets, or autonomous cars are far from being conscious enough to be under Asimov’s ethics. However, we are more and more surrounded by complex predictive models and algoritmhs used for decision making. Machine learning models are used in health care, politics, education, justice, and many other areas. The models and algorithms have far larger influence on our lives than physical robots. Yet, applications of such models are left unregulated despite examples of their potential harmfulness. See Weapons of Math Destruction by Cathy O’Neil (O’Neil 2016) for an excellent overview of selected problems.
It’s clear that some we need to control the models and algorithms that may affect us. Thus, Asimov’s laws are referred to in the context of the discussion around Ethics of Artifical Intelligence. Initiatives to formulate principles for the AI development have been undertaken, for instance, in the UK [Olhede & Wolfe, Significance 2018, 15: 6-7]. Following Asimov’s approach, we could propose three requirements that any predictive model should fulfill:
We see two ways to comply with these requirements. One is to use only models that fulfill these conditions by design. However, a reduction in performance may be the price for transparency. Another is to use tools that allow, perhaps by using approximations, to ,,explain’’ predictions for any model. In our book, we will focus on the latter.
It is worth noting that, when it comes to predictive models, the same concepts have often been given different names in statistics and in machine learning. For instance, in the statistical-modelling literature, one refers to ,,explanatory variables,’’ with ,,independent variables,’’ ,,predictors,’’ or ,,covariates’’ as often-used equivalents. Explanatory variables are used in the model as means to explain (predict) the ,,dependent variable,’’ also called ,,predicted’’ variable or ,,response.’’ In the machine-learning language, ,,input variables’’ or ,,features’’ are used to predict the ,,output’’ variable. In statistical modelling, models are fit to the data that contain ,,observations,’’ whereas in the machine-learning world a dataset may contain ,,instances.’’
To the extent possible, in our book we try to consistently use the statistical-modelling terminology. However, the reader may expect references to a ,,feature’’ here and there. Somewhat inconsistently, we also introduce the term ,,instance-level’’ explanation. Instance-level explanation methods are designed to extract information about the behavior of the model related to a specific observation or instance. On the other hand, ,,global’’ explanation techniques allow obtaining information about the behavior of the model for an entire dataset.
We consider models for dependent variables that can be continuous or nominal. The values of a continuous variable can be represented by numbers with an ordering that makes some sense (zip codes or phone numbers are not considered as continuous variables). A continuous variable does not have to be continuous in the mathematical sense; counts (number of floors, steps, etc.) will be treated as continuous variables as well. A nominal variable can assume only a finite set of values that cannot be given numeric values.
In this book we focus on ,,black-box’’ models. We discuss them in a bit more detail in the next section.
Black-box models are models with a complex structure that is hard to understand by humans. Usually this refers to a large number of model coefficients. As humans may vary in their capacity of understanding complex models, there is no strict threshold for the number of coefficients that makes a model a black-box. In practice, for most humans this threshold is probably closer to 10 than to 100.
A ,,white-box’’ model, which is opposite to a ,,black-box’’ one, is a model that is easy to understand by a human (though maybe not by every human). It has got a simple structure and a limited number of coefficients. The two most common classess of white-box models are decision or regression trees (see an example in Figure @ref(fig:BILLCD8)) or models with an additive structure, like the following model for mortality risk in melanoma patients:
\[ RelativeRisk = 1 + 3.6 * [Breslow > 2] - 2 * [TILs > 0] \]
In the model, two explanatory variables are used: an indicator whether the thickness of the lesion according to the Breslow scale is larger than 2 mm and an indicator whether the percentage of tumor-infiltrating lymphocytes (TILs) was larger than 0.
The structure of a white box-model is, in general, easy to understand. It may be difficult to collect the necessary data, build the model, fit it to the data, and/or perform model validation, but once the model has been developed its interpretation and mode of working is straightforward.
Why is it important to understand the model structure? There are several important advantages. If the model structure is clear, we can easily see which variables are included in the model and which are not. Hence, we may be able to, for instance, question the model when a particular explanatory variable was excluded from it. Also, in case of a model with a clear structure and a limited number of coefficients, we can easily link changes in model predictions with changes in particular explanatory variables. This, in turn, may allow us to challenge the model against the domain knowledge if, for instance, the effect of a particular variable on predictions is inconsistent with the previously established results. Note that linking changes in model predictions with changes in particular explanatory variables may be difficult when there are may variables and/or coefficients in the model. For instance, a classification tree with hundreds of nodes is difficult to understand, as is a linear regression model with hundreds of cofficients.
Getting the idea about the performance of a black-box model may be more challenging. The structure of a complex model like, e.g., a neural-network model, mmay be far from transparent. Consequently, we may not understand which features and how influence the model decisions. Consequently, it may be difficult to decide whether the model is consistent with the domain knowledge. In our book we present tools that can help in extracting the information necessary for the model evaluation for complex models.
The lifecycle of a model can be divided, in general, in three different phases: development (or building), deployment, and maintenance.
Model development is the phase in which one is looking for the best available model. During this process, model exploration tools are useful. Exploration involves evaluation of the fit of the model, verification of the assumptions underlying the model (diagnostics), and assessment of the predictive performance of the model (validation). In our book we will focus on the visualization tools that can be useful in model exploration. We will not, however, discuss visualization methods for diagnostic purposes, as they are extensively discussed in many books devoted to statistical modelling.
Model deployment is the phase in which a predictive model is adopted for use. In this phase it is crucial that the users gain confidence in using the model. It is worth noting that the users might not have been involved in the model development. Moreover, they may only have got access to the software implementing the model that may not provide any insight in the details of the model structure. In this situation, model explanation tools can help to understand the factors that influence model predictions and to gain confidence in the model. The tools are one of the main focu point of our book.
Finally, a deployed model requires maintenance. In this phase, one monitors model’s performance by, for instance, checking the validity of predictions for different datasets. If issues are detected, model explanation tools may be used to find the source of the problem and to suggest a modification of the structure of the model.
Some classes of models have been developed for a long period of time or have attracted a lot of interest with an intensive research as a result. Consequently, those classes of models are equipped with very good tools for model exploration or visualisation. For example:
Of course, the list of model classes with dedicated collections of model-explanation and/or diagnostics methods is much longer. This variety of model-specific approaches does lead to issues, though. For instance, one cannot easily compare explanations for two models with different structures. Also, every time when a new architecture or a new ensemble of models is proposed, one needs to look for new methods of model exploration. Finally, for brand-new models no tools for model explanation or diagnostics may be immedaitely available.
For these reasons, in our book we focus on model-agnostic techniques. In particular, we prefer not to assume anything about the model structure, as we may be dealing with a black-box model with an unclear structure. In that case, the only operation that we may be able to perform is evaluation of a model for a selected observation.
However, while we do not assume anything about the structure of the model, we will assume that the model operates on \(p\)-dimensional vectors and, for a single vector, it returns a single value which is a real number. This assumption holds for a broad range of models for data such as tabular data, images, text data, videos, etc. It may not be suitable for, e.g., models with memory in which the model output does not depend only on the model input [TOMASZ: NOT SURE WHICH MODELS ARE MEANT HERE].
Note that the techniques considered in the book may not be sufficient to fully understand models in case \(p\) is large.
TODO: Here we should tell why we present examples for DALEX. And mention that there are also other functions that can be used.
Our book is split in two parts. In the part Instance-level explainers, we present techniques for exploration and explanation of model predictions for a single observation. On the other hand, in the part Global explainers, we present techniques for exploration and explanation of model’s performance for an entire dataset. In each part, every method is described in a separate section that has got the same structure: * Subsection Introduction explains the goal of and the general idea behind the method. * Subsection The Algorithm shows mathematical or computational details related to the method. This subsection can be skipped if you are not interested in the details. * Subsection Example shows an exemplary application of the method with discussion of results. * Subsection Pros and Cons summarizes the advantages and disadvantages of the method. It also provides some guideance regarding when to use the method. * Subsection Code snippets shows the implementation of the method in R and Python. This subsection can be skipped if you are not interested in the implementation.
TO DO: A SHORT REVIEW OF THE CONTENTS OF VARIOUS CHAPTERS
Finally, we would like to signal that, in this book, we do show
On the other hand, in this book, we do not focus on
Przemek’s work on interpretability has started during research trips within the RENOIR project (691152 - H2020/2016-2019). So he would like to thank prof. Janusz Holyst for the chance to take part in this project.
Przemek would also like thank prof. Chris Drake for her hospitality. This book would have never been created without perfect conditions that Przemek found at Chris’ house in Woodland.
This book has been prepared by using the bookdown package (Xie 2018), created thanks to the amazing work of Yihui Xie.
We illustrate techniques introduced in this book on three datasets:
Titanic sinking by Willy Stöwer
Sinking of the RMS Titanic is one of the deadliest maritime disasters in history (during peacetime). Over 1500 people died as a consequence of collision with an iceberg. Thanks to projects like Encyclopedia titanica https://www.encyclopedia-titanica.org/ we have a very rich and precise data about passengers. This dataset is available in the titanic dataset.
library("titanic")
head(titanic_train, 2)
Feature of interest is the binary variable Survived. Let’s build some predictive models for this variable.
First we need to do some data preprocessing. Columns with characters are converted to factors and rows with missing data are removed.
titanic_small <- titanic_train[,c("Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked")]
titanic_small$Survived <- factor(titanic_small$Survived)
titanic_small$Sex <- factor(titanic_small$Sex)
titanic_small$Embarked <- factor(titanic_small$Embarked)
titanic_small <- na.omit(titanic_small)
head(titanic_small)
It is always a good idea to do data exploration before modeling. But since this book is focused on model exploration we will spend only a few lines on data exploration part. And we will limit ourselves to two-variable summaries for each variable.
The feature of interest survival is binary, thus a natural choice is a logistic regression. Most of predictive features are categorical except age.
There is no reason to expect a linear relation between age and odds of survival, thus for age we will use linear tail-restricted cubic splines available in the rcs() function in the rms package (Harrell Jr 2018).
library("rms")
lmr_model <- lrm(Survived == "1" ~ Pclass + Sex + rcs(Age) + SibSp +
Parch + Fare + Embarked, titanic_small)
lmr_model
## Logistic Regression Model
##
## lrm(formula = Survived == "1" ~ Pclass + Sex + rcs(Age) + SibSp +
## Parch + Fare + Embarked, data = titanic_small)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 714 LR chi2 353.20 R2 0.527 C 0.874
## FALSE 424 d.f. 12 g 2.192 Dxy 0.748
## TRUE 290 Pr(> chi2) <0.0001 gr 8.957 gamma 0.749
## max |deriv| 0.1 gp 0.363 tau-a 0.361
## Brier 0.134
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept 12.1530 26.3889 0.46 0.6451
## Pclass -1.0822 0.1693 -6.39 <0.0001
## Sex=male -2.7848 0.2317 -12.02 <0.0001
## Age -0.2382 0.0472 -5.04 <0.0001
## Age' 0.7961 0.2138 3.72 0.0002
## Age'' -4.3833 1.5123 -2.90 0.0038
## Age''' 5.1859 2.2693 2.29 0.0223
## SibSp -0.5292 0.1449 -3.65 0.0003
## Parch -0.1990 0.1344 -1.48 0.1387
## Fare 0.0032 0.0029 1.11 0.2670
## Embarked=C -4.6054 26.3806 -0.17 0.8614
## Embarked=Q -5.3341 26.3853 -0.20 0.8398
## Embarked=S -4.9968 26.3802 -0.19 0.8498
##
In addition to a logistic regression we will use a random forest model with default settings. Random forest is known for good performance, is able to grasp low-level variable interactions and is quite stable.
Here we are using the randomForest package (Liaw and Wiener 2002).
library("randomForest")
rf_model <- randomForest(Survived ~ Pclass + Sex + Age + SibSp +
Parch + Fare + Embarked,
data = titanic_small)
rf_model
##
## Call:
## randomForest(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data = titanic_small)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 18.21%
## Confusion matrix:
## 0 1 class.error
## 0 382 42 0.0990566
## 1 88 202 0.3034483
Last model that we will train on this dataset is the gradient boosting model. This family of models is known for being able to grasp deep interactions between variables.
Here we are using the implementation from the gbm package (Ridgeway 2017).
library("gbm")
gbm_model <- gbm(Survived == "1" ~ Pclass + Sex + Age + SibSp +
Parch + Fare + Embarked, data = titanic_small, n.trees = 15000)
## Distribution not specified, assuming bernoulli ...
gbm_model
## gbm(formula = Survived == "1" ~ Pclass + Sex + Age + SibSp +
## Parch + Fare + Embarked, data = titanic_small, n.trees = 15000)
## A gradient boosted model with bernoulli loss function.
## 15000 iterations were performed.
## There were 7 predictors of which 7 had non-zero influence.
Having all three models let’s see what are odds of surviving for a 2-years old boy that travels in the 3rd class with 1 parent and 3 siblings.
henry <- data.frame(
Pclass = 1,
Sex = factor("male", levels = c("female", "male")),
Age = 8,
SibSp = 0,
Parch = 0,
Fare = 72,
Embarked = factor("C", levels = c("","C","Q","S"))
)
Logistic regression model says 88.3% for survival.
predict(lmr_model, henry, type = "fitted")
## 1
## 0.8836101
Random forest model says 53.2% for survival.
predict(rf_model, henry, type = "prob")
## 0 1
## 1 0.464 0.536
## attr(,"class")
## [1] "matrix" "votes"
Gradient boosting model says 53.2% for survival.
predict(gbm_model, henry, type = "response", n.trees = 15000)
## [1] 0.8315509
Three different opinions. Which one should we trust? Tools introduced in following sections will help to understand how these models are different.
(fig:localDALEXsummary) Summary of three approaches to local model exploration and explanation.
Instance-level explainers help to understand how a model yields a prediction for a single observation. We can think about the following situations as examples:
A model is a function with a \(p\)-dimensional vector \(x\) as an argument. The plot of the value(s) of the function can be constructed in a \(p+1\)-dimensional space. An example with \(p=2\) is presented in Figure @ref(fig:cutsSurfaceReady). We will use it as an illustration of key ideas. The plot provides an information about the values of the function in the vicinity of point \(x^*\).
(fig:cutsSurfaceReady) Response surface for a model that is a function of two variables. We are interested in understanding the response of a model in a single point x*
There are many different tools that may be used to explore the predictions of the model around a single point \(x^*\). In the following sections we will describe the most popular approaches. They can be divided into three classes.
(fig:cutsTechnikiReady) Illustration of different approaches to instance-level explanation. Panel A presents a What-If analysis with Ceteris Paribus profiles. The profiles plot the model response as a function of a value of a single variable, while keeping the values of all other explanatory variables fixed. Panel B illustrates the concept of local models. A simpler white-box model is fitted around the point of interest. It describes the local behaviour of the black-box model. Panel C illustrates the concept of variable attributions. Additive effects of each variable show how the model response differs from the average.
In this section we introduce tools based on Ceteris Paribus principle. The main goal for these tools is to help understand how changes in model input affect changes in model output.
Presented explainers are linked with the second law introduced in Section @ref(three-single-laws), i.e. law for prediction’s speculations. This is why these explainers are also known as What-If model analysis or Individual Conditional EXpectations (Goldstein et al. 2015). It turns out that it is easier to understand how blacx-box model is working if we can play with it by changing variable by variable.
Ceteris paribus is a Latin phrase meaning “other things held constant” or “all else unchanged”. Using this principle we examine input variable per variable separatly, asumming that effects of all other variables are unchanged. See Figure @ref(fig:modelResponseCurveLine)
(fig:modelResponseCurveLine) A) Model response surface. Ceteris Paribus profiles marked with black curves helps to understand the curvature of the model response by updating only a single variable. B) CP profiles are individual conditional model responses
Similar to the LIME method introduced in the section @ref(LIME), Ceteris Paribus profiles examine curvature of a model response function. The difference between these two methods that LIME approximates the model curvature with a simpler white-box model that is easier to present. Usually the LIME model is sparse, thus our attention may be limited to smaller number of dimensions. In contrary, the CP plots show conditional model response for every variable. In the last subsection we discuss pros and cons of this approach.
Let \(f_{M}(x): \mathcal R^{d} \rightarrow \mathcal R\) denote a predictive model, i.e. function that takes \(d\) dimensional vector and calculate numerical score. Symbol \(x \in \mathcal R^d\) refers to a point in the feature space. We use subscript \(x_i\) to refer to a different data points and superscript \(x^j\) to refer to specific dimensions. Additionally, let \(x^{-j}\) denote all coordinates except \(j\)-th and let \(x|^j=z\) denote a data point \(x^*\) with all coordinates equal to \(x\) except coordinate \(j\) equal to value \(z\). I.e. \(\forall_{i \neq {j}} x^i = x^{*,i}\) and \(x^j = z\). In other words \(x|^j=z\) denote a \(x\) with \(j\)th coordinate changed to \(z\).
Now we can define uni-dimensional Ceteris Paribus Profile for model \(f\), variable \(j\) and point \(x\) as
\[ CP^{f, j, x}(z) := f(x|^j = z). \] I.e. CP profile is a model response obtained for observations created based on \(x\) with coordinate \(j\) changed and all other coordinates kept unchanged.
A natural way to visualise CP profiles is to use a profile plot as in Figure @ref(fig:HRCPFiredHours).
Figure @ref(fig:HRCPFiredHours) shows an example of Ceteris Paribus profile. The black dot stands for prediction for a single observation. Grey line show how the model response would change if in this single observation coordinate hours will be changed to selected value. One thing that we can read is that the model response is not smooth and there is some variability along the profile. Second thing is that for this particular observation the model response would drop significantly if the variable hours will be higher than 45.
(fig:HRCPHiredHours) Ceteris Paribus profile for Random Forest model that assess the probability of being fired in call center as a function of average number of working hours
Since in the example dataset we are struggling with model for three classes, one can plot CP profiles for each class in the same panel. See an example in the Figure @ref(fig:HRCPAllHours).
(fig:HRCPAllHours) Ceteris Paribus profiles for three classess predicted by the Random Forest model as a function of average number of working hours
Usually model input consist many variables, then it is beneficial to show more variables at the same time. The easiest way to do so is to plot consecutive variables on separate panels. See an example in Figure @ref(fig:HRCPFiredAll).
(fig:HRCPFiredAll) Ceteris Paribus profiles for all continuous variables
Visual examination of variables is insightful, but for large number of variables we end up with large number of panels, most of which are flat. This is why we want to asses variable importance and show only profiles for important variables. The advantage of CP profiles is that they lead to a very natural and intuitive way of assessing the variable importance for a single prediction. The intuition is: the more important variable the larger are changes along the CP profile. If variable is not important then model response will barely change. If variable is important the CP profile change a lot for different values of a variable.
Let’s write it down in a more formal way.
Let \(vip^{CP}_j(x)\) denotes variable importance calculated based on CP profiles in point \(x\) for variable \(j\).
\[ vip^{CP}_j(x) = \int_{-\inf}^{inf} |CP^{f,j,x}(z) - f(x)| dz \]
So it’s an absolute deviation from \(f(x)\). Note that one can consider different modification of this coefficient:
TODO: we need to verify which approach is better. Anna Kozak is working on this
The straightforward estimator for \(vip^{CP}_j(x)\) is
\[ \widehat{ vip^{CP}_j(x)} = \frac 1n \sum_{i=1}^n |CP^{f,j,x}(x_i) - f(x)|. \]
Figure @ref(fig:CPVIPprofiles) shows the idea behind measuring oscillations. The larger the highlighted area the more important is the variable.
(fig:CPVIPprofiles) CP oscillations are average deviations between CP profiles and the model response
Figure @ref(fig:CPVIP1) summarizes variable oscillations. Such visuals help to quickly grasp how large are model oscillations around a specific point.
(fig:CPVIP1) Variable importance plots calculated for Ceteris Paribus profiles for observation ID: 1001
NOTE
Variable importance for single prediction may be very different than variable importance for the full model.
For example, consider a model \[ f(x_1, x_2) = x_1 * x_2 \] where variables \(x_1\) and \(x_2\) takes values in \([0,1]\).
From the global perspective both variables are equally important.
But local variable importance is very different. Around point \(x = (0, 1)\) the importance of \(x_1\) is much larger than \(x_2\). This is because profile for \(f(z, 1)\) have larger oscillations than \(f(0, z)\).
The definition of ceteris paribus profiles given in section @ref(ceterisParibus1d) may be easily extended to two and more variables. Also definition of CP oscillations @ref(oscillations) have straight forward generalization for larger number of dimensions. Such generalisations are usefull when model is non additive. Presence of pairwise interactions may be detected with 2D Ceteris Paribus plots.
Let’s define two-dimensional Ceteris Paribus Profile for model \(f\), variables \(j\) and \(k\) and point \(x\) as
\[ CP^{f, (j,k), x}(z_1, z_2) := f(x|^{(j,k)} = (z_1,z_2)). \] I.e. CP profile is a model response obtained for observations created based on \(x\) with \(j\) and \(k\) coordinates changed to \((z_1, z_2)\) and all other coordinates kept unchanged.
A natural way to visualise 2D CP profiles is to use a level plot as in Figure @ref(fig:CP2Dsurflor).
(fig:CP2Dsurflor) Ceteris Paribus plot for a pair of variales. Black cross marks coordinated for the observation of interest. Presented model estimates price of an appartment
If number of variables is small or moderate thein it is possible to present all pairs of variables. See an example in Figure @ref(fig:CP2Dall).
(fig:CP2Dall) Ceteris Paribus plot for all pairs of variales.
Ceteris Paribus profiles are also a useful tool to validate local model fidelity. It may happen that global performance of the model is good, while for some points the local fit is very bad. Local fidelity helps to understand how good is the model fit around point of interest.
How does it work?
The idea behind fidelity plots is to select some number of points from the validation dataset that are closes to the point of interest. It’s a similar approach as in k nearest neighbours. Then for these neighbours we may plot Ceteris Paribus Profiles and check how stable they are.
Also, if we know true taget values for points from the validation dataset we may plot residuals to show how large are residuals.
An example fidelity plot is presented in Figure @ref(fig:CPfidelity1). Black line shows the CP profiles for the point of interest, while grey lines show CP profiles for neihgbors. Red intervals stand for residuals and in this example it looks like residuals for neighbours are all negative. Thus maybe model is biased around the point of interest.
(fig:CPfidelity1) Local fidelity plots. Black line shows the CP profile for the point of interest. Grey lines show CP profiles for nearest neighbors. Red intervals correspond to residuals. Each red interval starts in a model prediction for a selected neighbor and ends in its true value of target variable.
This observation may be confirmed by plots that compare distribution of all residuals against distribution of residuals for neighbors.
See Figure @ref(fig:CPfidelityBoxplot) for an example. Here residuals for neighbors are shifted towards highest values. This suggests that the model response is biased around the observation of interest.
TODO: diagnostic score: average quantaile of neighbours.
(fig:CPfidelityBoxplot) Distribution of residuals for whole validation data (grey boxplot) and for selected closes 15 neighbors (red boxplot).
Ceteris Paribus principle gives a uniform and extendable approach to model exploration. Below we summarize key strengths and weaknesses of this approach.
Pros
Cons
In this section we present key features of the ceterisParibus package for R (Biecek 2018b). This package covers all features presented in this chapter. It is available on CRAN and GitHub. Find more examples at the website of this package https://pbiecek.github.io/ceterisParibus/.
A very interesting tool for moedl explorartion with similar principle is implemented in the condvis package (O’Connell, Hurley, and Domijan 2017).
Model preparation
In this section we will present examples based on the apartments dataset. See section TODO for more details.
library("DALEX")
head(apartments)
The problem here is to predict average price for square meter for an apartment. Let’s build a random forest model with randomForest package (Breiman et al. 2018).
library("randomForest")
rf_model <- randomForest(m2.price ~ construction.year + surface + floor +
no.rooms, data = apartments)
rf_model
##
## Call:
## randomForest(formula = m2.price ~ construction.year + surface + floor + no.rooms, data = apartments)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 483086.5
## % Var explained: 41.18
Model exploration with ceterisParibus package is performed in four steps.
1. Create an explainer - wrapper around model and validation data.
Since all other functions work in a model agnostic fashion, first we need to define a wrapper around the model. Here we are using the explain() function from DALEX package (Biecek 2018c).
library("DALEX")
explainer_rf <- explain(rf_model,
data = apartmentsTest, y = apartmentsTest$m2.price)
explainer_rf
## Model label: randomForest
## Model class: randomForest.formula,randomForest
## Data head :
## m2.price construction.year surface floor no.rooms district
## 1001 4644 1976 131 3 5 Srodmiescie
## 1002 3082 1978 112 9 4 Mokotow
2. Define point of interest.
Certeris Paribus profiles explore model around a single point.
new_apartment <- data.frame(construction.year = 1965, no.rooms = 5, surface = 142, floor = 8)
new_apartment
predict(rf_model, new_apartment)
## 1
## 2325.287
3. Calculate CP profiles
The ceteris_paribus() function calculates CP profiles for selected model around selected observation.
By default CP profiles are calculated for all numerical variables. Use the variables argument to select subset of interesting variables. The result from ceteris_paribus()function is a data frame with model predictions for modified points around the point of interest.
library("ceterisParibus")
cp_rf <- ceteris_paribus(explainer_rf, new_apartment,
variables = c("construction.year", "floor"))
cp_rf
4. Plot CP profiles.
Generic plot() function plot CP profiles. It returns a ggplot2 object that can be polished if needed. Use additional arguments of this function to select colors and sizes for elements visible in the plot.
plot(cp_rf)
One of very useful features of ceterisParibus explainers is that profiles for two or more models may be superimposed in a single plot. This helps in model comparisons.
Let’s create a linear model for this dataset and repeat steps 1-3 for the lm model.
lm_model <- lm(m2.price ~ construction.year + surface + floor +
no.rooms, data = apartments)
explainer_lm <- explain(lm_model,
data = apartmentsTest, y = apartmentsTest$m2.price)
cp_lm <- ceteris_paribus(explainer_lm, new_apartment,
variables = c("construction.year", "floor"))
Now we can use function plot() to compare both models in a single chart. Additional argument color = "_label_" set color as a key for model.
plot(cp_rf, cp_lm, color = "_label_")
Oscillations
The calculate_oscillations() function calculates oscillations for CP profiles.
cp_rf_all <- ceteris_paribus(explainer_rf, new_apartment)
co_rf_all <- calculate_oscillations(cp_rf_all)
co_rf_all
plot(co_rf_all)
2D Ceteris Paribus profiles
And the what_if_2d() function calculates 2D CP profiles.
wi_rf_2d <- what_if_2d(explainer_rf, observation = new_apartment,
selected_variables = c("surface","floor", "construction.year"))
plot(wi_rf_2d, split_ncol = 2)
In this chapter we introduce the concept and the intuitions underlying ‘’variable attribution,’’ i.e., the decomposition of the difference between the single-instance and the average model predictions among the different explanatory variables. We can think about the following examples:
In each of those cases we want to attribute a part of the model prediction to a single explanatory variable. This can be done directly for linear models. Hence, in this chapter We focus on those models. The method can be easily extended to generalized linear models. Model-agnostic approaches will be presented in Chapters @ref(breakDown) and @ref(shapley).
Assume a classical linear model for response \(Y\) with \(p\) explanatory variables collected in the vector \(X = (X_1, X_2, \ldots, X_p)\) and coefficients \(\beta = (\beta_0, \beta_1, .., \beta_p)\), where \(\beta_0\) is the intercept. The prediction for \(Y\) at point \(X=x=(x_1, x_2, \ldots, x_p)\) is given by the expected value of \(Y\) conditional on \(X=x\). For a linear model, the expected value is given by the following linear combination:
\[ E_Y(Y | x) = f(x) = \beta_0 + x_1 \beta_1 + \ldots + x_p \beta_p. \] We are interested in the contribution of the \(i\)-th explanatory variable to model prediction \(f(x^*)\) for a single observation described by \(x^*\). In this case, the contribution is equal to \(x^*_i\beta_i\), because the \(i\)-th variable occurs only in this term. As it will become clear in the sequel, it is easier to interpret the variable’s contribution if \(x_i\) is is centered by subtracting a constant \(\hat x_i\) (usually, the mean of \(x_i\)). This leads the following, intuitive formula for the variable attribution: \[ v(f, x^*, i) = \beta_i (x_i^* - \hat x_i). \]
We want to calculate \(v(f, x^*, i)\), which is the contribution of the \(i\)-th explanatory variable to the prediction of model \(f()\) at point \(x^*\). Assume that \(E_Y(Y | x^*) \approx f(x^*)\), where \(f(x^*)\) is the value of the model at \(x^*\). A possible approach to define \(v(f, x^*, i)\) is to measure how much the expected model response changes after conditioning on \(x_i^*\): \[
v(f, x^*, i) = E_Y(Y | x^*) - E_{X_i}\{E_Y[Y | (x_1^*,\ldots,x_{i-1}^*,X_i,x_{i+1}^*,x_p^*)]\}\approx f(x^*) - E_{X_i}[f(x_{-i}^*)],
\] where \(x_{-i}^*\) indicates that variable \(X_i\) in vector \(x_{-i}^*\) is treated as random. For the classical linear model, if the explanatory variables are independent, \(v(f, x^*, i)\) can be expressed as follows: \[
v(f, x^*, i) = f(x^*) - E_{X_i}[f(x_{-i}^*)] = \beta_0 + x_1^* \beta_1 + \ldots + x_p^* \beta_p - E_{X_i}[\beta_0 + x_1^* \beta_1 + \ldots +\beta_i X_i \ldots + x_p^* \beta_p] = \beta_i[x^*_i - E_{X_i}(X_i)].
\] In practice, given a dataset, the expected value of \(X_i\) can be estimated by the sample mean \(\bar x_i\). This leads to
\[
v(f, x^*, i) = \beta_i (x^*_i - \bar x_i).
\] Note that the linear-model-based prediction may be re-expressed in the following way: \[
f(x^*) = [\beta_0 + \bar x_1 \beta_1 + ... + \bar x_p \beta_p] + [(x_1^* - \bar x_1) \beta_1 + ... + (x_p^* - \bar x_p) \beta_p] \equiv [average \ prediction] + \sum_{j=1}^p v(f, x^*, j).
\] Thus, the contributions of the explanatory variables are the differences between the model prediction for \(x^*\) and the average prediction.
** NOTE for careful readers **
Obviously, sample mean \(\bar x_i\) is an estimator of the expected value \(E_{X_i}(X_i)\), calculated using a dataset. For the sake of simplicity we do not emphasize these differences in the notation. Also, we ignore the fact that, in practice, we never know the model coefficients and we work with an estimated model.
Also, we assumed that the explanatory variables are independent, which may not be the case. We will return to this problem in Section [TOMASZ: INSERT REF], when we will discuss interactions.
Figure @ref(fig:attribution1) shows the relation between alcohol and wine quality, based on the wine dataset (Cortez et al. 2009). The linear model is \[ quality(alcohol) = 2.5820 + 0.3135 * alcohol. \] The weakest wine in the dataset has 8% of alcohol, while the average alcohol concentration is 10.51%. Thus, the contribution of alcohol to the model prediction for the weakest wine is \(0.3135 \cdot (8-10.51) = -0.786885\). This means that low concentration of alcohol for this wine (8%) decreses the predicted quality by \(-0.786885\).
Note, that it would be misleading to use \(x_i^*\beta_i = 0.3135*8 = 2.508\) as the alcohol contribution to the quality. The positive value of the product would not correrspond to the intuition that, in the presence of a positive relation, a smaller alcohol concentration should imply a lower quality of the wine.
(fig:attribution1)Relation between wine quality and concentration of alcohol assessed with linear model
The proposed definition of \(v(f, x^*, i)\) is linked neither with scale nor location of \(X_i\); hence, it is easier to understand than, for instance, the standardized value of \(\beta_i\). [TOMASZ: I AM NOT SURE ABOUT THIS ARGUMENT. THE DEFINITION OF v() INVOLVES CENTERING.] For the classical linear model, \(v(f, x^*, i)\) is not an approximation and it is directly linked with the structure of a model. An obvious diasadvantage is that the definition of \(v(f, x^*, i)\) is very much linear-model based. Also, it does not, in any way, reduce the model complexity; it only presents model coefficients in a different way. [TOMASZ: I AM NOT SURE ABOUT THIS ARGUMENT. MODEL COMPLEXITY IS WHAT IT IS. HOW CAN WE REDUCE IT?]
In this section, we present an example of computing variable attributions using the HR dataset (see Section @ref(HRdataset) for more details).
To calculate variable attributions for a particular point, first we have got to define this point:
new_observation <- data.frame(gender = factor("male", levels = c("male", "female")),
age = 57.7,
hours = 42.3,
evaluation = 2,
salary = 2)
Variable attributions for linear and generalized linear models may be directly extracted by applying the predict() function, with the argument type = "terms", to an object containing results of fitting of a model. To illustrate the approach for logistic regression, we build a logistic regression model for the binary variable status == "fired" and extract the estimated model coefficients:
library("DALEX")
model_fired <- glm(status == "fired" ~ ., data = HR, family = "binomial")
coef(model_fired)
## (Intercept) gendermale age hours evaluation
## 5.737945729 -0.066803609 -0.001503314 -0.102021120 -0.425793369
## salary
## -0.015740080
For the new observation, the predicted value of the logit of the probability of being fired is obtained by applying the predict() function:
as.vector(pred.inst <- predict(model_fired, new_observation)) # new pediction
## [1] 0.3858406
On the other hand, variable attributions can be obtained by applying the predict() function with the type="terms" argument:
(var.attr <- predict(model_fired, new_observation, type = "terms")) # attributions
## gender age hours evaluation salary
## 1 -0.03361889 -0.02660691 0.7555555 0.5547197 0.007287334
## attr(,"constant")
## [1] -0.8714962
The largest contributions to the prediction come from variables ‘’hours’’ and ‘’evaluation.’’ Variables ‘’gender’’ and ‘’age’’ slightly decrease the predicted value. The sum of the attributions is equal to
sum(var.attr)
## [1] 1.257337
The attribute constant of object var.attr provides the ‘’average’’ prediction, i.e., the predicted logit for an obsrvation defined by the means of the explanatory variables, as can be seen from the calculation below:
coef(model_fired)%*%c(1, mean((HR$gender=="male")), mean(HR$age), mean(HR$hours),
mean(HR$evaluation), mean(HR$salary))
## [,1]
## [1,] -0.8714962
Adding the ‘’average’’ prediction to the sum of the variable attributions results in the new-observation prediction:
attributes(var.attr)$constant + sum(var.attr)
## [1] 0.3858406
Below we illustrate how to implement this approach with the DALEX package. Toward this end, functions explain() and single_prediction() can be used. Object model_fired stores the definition of the logistic-regression model used earlier in this section. The contents of object attribution correspond to the results obtained by using function predict().
library("DALEX")
explainer_fired <- explain(model_fired,
data = HR,
y = HR$status == "fired",
label = "fired")
(attribution <- single_prediction(explainer_fired, new_observation))
After object attribution has been created, a plot presenting the variable attributions can be easily constructed:
plot(attribution)
In the Section @ref(variableAttributionMethods) we introduced a method for calculation of variable attributions for linear models. This method is accurate, based directly on the structure of the model. But for most popular machine learning models we cannot assume that they are linear nor even additive.
For any model we may repeat the intuition presented in Section @ref(variableAttributionMethods) to calculate variable contribution as shifts in expected model response after conditioning over consecutive variables. This intuition is presented in Figure @ref(BDPrice4).
Panel A shows distribution of model responses. The row all data shows the model response of the original dataset. The red dot stands for average is is an estimate of expencted model response \(E [f(x)]\).
Since we want to calculate effects of particular values of selected variables we then condition over these variables in a sequential manner. The next row in panel A corresponds to average model prediction for observations with variable surface fixed to value 35. The next for corresponds to average model prediction with variables surface set to 35 and floor set to 1, and so on. The top row corresponds to model response for \(x^*\).
Black lines in the panel A show how prediction for a single point changes after coordinate \(i\) is repaced by the \(x^*_i\). But finaly we are not interestes in particular changes, not even in distributions but only in averages - expected model responses.
The most minimal form that shows important information is presented in the panel C. Positive values are presented with green bars while negative differences are marked with yellow bar. They sum up to final model prediction, which is denoted by a grey bar in this example.
(fig:BDPrice4) Break Down Plots show how variables move the model prediction from population average to the model prognosis for a single observation. A) The last row shows distribution of model predictions. Next rows show conditional distributions, every row a new variable is added to conditioning. The first row shows model prediction for a single point. Red dots stand for averages. B) Blue arrows shows how the average conditional response change, these values are variables contributions. C) Only variable contributions are presented.
Again, let \(v(f, x^*, i)\) stands for the contribution of variable \(x_i\) on prediction of model \(f()\) in point \(x^*\).
We expect that such contribution will sum up to the model prediction in a given point (property called local accuracy), so \[ f(x^*) = baseline + \sum_{i=1}^p v(f, x^*, i) \] where \(baseline\) stands for average model response.
Note that the equation above may be rewritten as
\[ E [f(X)|X_1 = x_1^*, \ldots, X+p = x_p^*] = E[f(X)] + \sum_{i=1}^p v(f, x^*, i) \] what leads to quite natural proposition for \(v(f, x^*_i, i)\), such as
\[ v(f, x^*_i, i) = E [f(X) | X_1 = x_1^*, \ldots, X_i = x_i^*] - E [f(X) | X_1 = x_1^*, \ldots, X_{i-1} = x_{i-1}^*] \] In other words the contribution of variable \(i\) is the difference between expected model response conditioned on first \(i\) variables minus the model response conditioned on first \(i-1\) variables.
Such proposition fulfills the local accuracy condition, but unfortunatelly variable contributions depends on the ordering of variables.
(fig:ordering) Two different paths between average model prediction and the model prediction for a selected observation. Black dots stand for conditional average, red arrows stands for changes between conditional averages.
See for example Figure @ref(fig:ordering). In the first ordering the contribution of variable age is calculated as 0.01, while in the second the contribution is calculated as 0.13. Such differences are related to the lack of additivness of the model \(f()\). Propositions presented in next two sections present different solutions for this problem.
The approach for variable attribution presented in the Section @ref(modelAgnosticAttribution) has the property of local accuracy, but variable contributions depends on the variable ordering.
The easiest way to solve this problem is to use two-step procedure. In the first step variables are ordered and in the second step the consecutive conditioning is applied to ordered variables.
First step of this algorithm is to determine the order of variables for conditioning. It seems to be reasonable to include first variables that are likely to be most important, leaving the noise variables at the end. This leads to order based on following scores
\[ score(f, x^*, i) = \left| E [f(X)] - E [f(X)|X_i = x^*_i] \right| \] Note, that the absolute value is needed as variable contributions can be both positive and negative.
Once the ordering is determined in the second step variable contributions are calculated as
\[ v(f, x^*_i, i) = E [f(X) | X_{I \cup \{i\}} = x_{I \cup \{i\}}^*] - E [f(X) | X_{I} = x_{I}^*] \] where \(I\) is the set of variables that have scores smaller than score for variable \(i\).
\[ I = \{j: score(f, x^*, j) < score(f, x^*, i)\} \]
The time complexity of the first step id \(O(p)\) where \(p\) is the number of variables and the time complexity of the second step is also \(O(p)\).
Let us consider a random forest model created for HR data. The average model response is \(\bar f(x) = 0.385586\). For a selected observation \(x^*\) the table below presents scores for particular variables.
| Ei f(X) | scorei | |
|---|---|---|
| hours | 0.616200 | 0.230614 |
| salary | 0.225528 | 0.160058 |
| evaluation | 0.430994 | 0.045408 |
| age | 0.364258 | 0.021328 |
| gender | 0.391060 | 0.005474 |
Once we determine the order we can calculate sequential contributions
| variable | cumulative | contribution |
|---|---|---|
| (Intercept) | 0.385586 | 0.385586 |
| * hours = 42 | 0.616200 | 0.230614 |
| * salary = 2 | 0.400206 | -0.215994 |
| * evaluation = 2 | 0.405776 | 0.005570 |
| * age = 58 | 0.497314 | 0.091538 |
| * gender = male | 0.778000 | 0.280686 |
| final_prognosis | 0.778000 | 0.778000 |
Break Down approach is model agnostic, can be applied to any predictive model that returns a single number. It leads to additive variable attribution. Below we summarize key strengths and weaknesses of this approach.
Pros
Cons
In this section we present key features of the breakDown2 package for R (Biecek 2018a). This package covers all features presented in this chapter. It is available on CRAN and GitHub. Find more examples at the website of this package https://pbiecek.github.io/breakDown2/.
Model preparation
In this section we will present an example based on the HR dataset and Random Forest model (Breiman et al. 2018). See the Section @ref(HRdataset) for more details.
library("DALEX2")
library("randomForest")
model <- randomForest(status ~ gender + age + hours + evaluation + salary, data = HR)
model
##
## Call:
## randomForest(formula = status ~ gender + age + hours + evaluation + salary, data = HR)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 27.36%
## Confusion matrix:
## fired ok promoted class.error
## fired 2272 390 193 0.2042032
## ok 537 1251 433 0.4367402
## promoted 200 394 2177 0.2143630
Model exploration with the breakDown2 package is performed in three steps.
1. Create an explainer - wrapper around model and validation data.
Since all other functions work in a model agnostic fashion, first we need to define a wrapper around the model. Here we are using the explain() function from DALEX2 package (Biecek 2018c).
explainer_rf <- explain(model,
data = HR,
y = HR$status)
2. Select an observation of interest.
Break Down Plots decompose model prediction around a single observation. Let’s construct a data frame with corresponding values.
new_observation <- data.frame(gender = factor("male", levels = c("male", "female")),
age = 57.7,
hours = 42.3,
evaluation = 2,
salary = 2)
predict(model, new_observation, type = "prob")
## fired ok promoted
## 1 0.79 0.202 0.008
## attr(,"class")
## [1] "matrix" "votes"
3. Calculate Break Down decomposition
The local_attributions() function calculates Break Down contributions for a selected model around a selected observation.
The result from local_attributions() function is a data frame with variable attributions.
library("breakDown2")
bd_rf <- local_attributions(explainer_rf,
new_observation,
keep_distributions = TRUE)
bd_rf
## contribution
## randomForest.fired: baseline 0.377
## randomForest.fired: * hours = 42 0.238
## randomForest.fired: * evaluation = 2 0.064
## randomForest.fired: * salary = 2 -0.243
## randomForest.fired: * age = 58 0.075
## randomForest.fired: * gender = male 0.280
## randomForest.fired: prediction 0.790
## randomForest.ok: baseline 0.274
## randomForest.ok: * hours = 42 -0.050
## randomForest.ok: * evaluation = 2 0.092
## randomForest.ok: * salary = 2 0.241
## randomForest.ok: * age = 58 -0.072
## randomForest.ok: * gender = male -0.283
## randomForest.ok: prediction 0.202
## randomForest.promoted: baseline 0.349
## randomForest.promoted: * hours = 42 -0.188
## randomForest.promoted: * evaluation = 2 -0.155
## randomForest.promoted: * salary = 2 0.002
## randomForest.promoted: * age = 58 -0.004
## randomForest.promoted: * gender = male 0.003
## randomForest.promoted: prediction 0.008
## baseline: 0
The generic plot() function creates a Break Down plots.
plot(bd_rf)
Add the plot_distributions = TRUE argument to enrich model response with additional information.
plot(bd_rf, plot_distributions = TRUE)
In the Section @ref(breakDown) we presented model agnostic approach for additive decomposition of a model prediction for a single observation.
For non-additive models the variables contributions depend on values of other variables.
In this section we present an algorithm that identifies interactions between pairs of variables and include such interactions in variable decomposition plots. Here we present an algorithm for pairs of variables, but it can be easily generalized to larger number of variables.
The key idea here is to identify interactions between variables. This can be done in two steps.
TODO: easy example for interaction
This algorithm is also composed out of two steps. In the first step variables and pairs of variables are ordered in terms of their importance, while in the second step the consecutive conditioning is applied to ordered variables.
To determine an importance of variables and pairs of variables following scores are being calculated.
For a single variable
\[ score_1(f, x^*, i) = \left| E [f(X)|X_i = x^*_i] - E [f(X)]\right| \] For pairs of variables
\[ score_2(f, x^*, (i,j)) = \left| E [f(X)|X_i = x^*_i, X_j = x^*_j] - E [f(X)|X_i = x^*_i] - E [f(X)| X_j = x^*_j]+ E [f(X)] \right| \] Note that this is equivalent to
\[ score_2(f, x^*, (i,j)) = \left| E [f(X)|X_i = x^*_i, X_j = x^*_j] - score_1 (f, x^*, i) - score_1 (f, x^*, j) + baseline \right| \] In other words the \(score_1(f, x^*, i)\) measures how much the average model response changes if variable \(x_i\) is set to \(x_i^*\), which is some index of local variable importance. On the other hand the \(score_2(f, x^*, (i,j))\) measures how much the change is different than additive composition of changes for \(x_i\) and \(x_j\), which is some index of local interaction importance.
Note, that for additive models \(score_2(f, x^*, (i,j))\) shall be close to zero. So the larger is this value the larger deviation from additivness.
The second step of the algorithm is the sequential conditioning. In this version in every new step we condition on a single variable of pair of variables in an order determined by \(score_1\) and \(score_2\).
The complexity of the first step id \(O(p^2)\) where \(p\) stands for the number of variables. The complexity of the second step is \(O(p)\).
Again, let us consider a HR dataset. The table below shows \(score_1\) and \(score_2\) calculated for consecutive variables.
| Ei f(X) | score1 | score2 | |
|---|---|---|---|
| hours | 0.616200 | 0.230614 | |
| salary | 0.225528 | -0.160058 | |
| age:gender | 0.516392 | 0.146660 | |
| salary:age | 0.266226 | 0.062026 | |
| salary:hours | 0.400206 | -0.055936 | |
| evaluation | 0.430994 | 0.045408 | |
| hours:age | 0.635662 | 0.040790 | |
| salary:evaluation | 0.238126 | -0.032810 | |
| age | 0.364258 | -0.021328 | |
| evaluation:hours | 0.677798 | 0.016190 | |
| salary:gender | 0.223292 | -0.007710 | |
| evaluation:age | 0.415688 | 0.006022 | |
| gender | 0.391060 | 0.005474 | |
| hours:gender | 0.626478 | 0.004804 | |
| evaluation:gender | 0.433814 | -0.002654 |
Once we determined the order, we can calculate sequential conditionings. In the first step we condition over variable hours, then over salary. The third position is occupied by interaction between age:gender thus we add both variables to the conditioning
| variable | cumulative | contribution |
|---|---|---|
| (Intercept) | 0.385586 | 0.385586 |
| * hours = 42 | 0.616200 | 0.230614 |
| * salary = 2 | 0.400206 | -0.215994 |
| * age:gender = 58:male | 0.796856 | 0.396650 |
| * evaluation = 2 | 0.778000 | -0.018856 |
| final_prognosis | 0.778000 | 0.778000 |
Break Down Plots for interactions are similar in structure as plots for single variables. The only difference is that in some rows pair of variable is listed in a single row. See an example in Figure @ref(BDPrice4).
(fig:bdInter1) Break Down Plot for variable attrbution with interactions
Break Down for interactions shares many features of Break Down for single variables. Below we summarize unique strengths and weaknesses of this approach.
Pros
Cons
The algorithm for Break Down for Interactions is also implemented in the local_interactions function from breakDown2 package.
Model preparation
First a model needs to be trained.
library("DALEX2")
library("randomForest")
model <- randomForest(status ~ gender + age + hours + evaluation + salary, data = HR)
model
##
## Call:
## randomForest(formula = status ~ gender + age + hours + evaluation + salary, data = HR)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 27.34%
## Confusion matrix:
## fired ok promoted class.error
## fired 2277 389 189 0.2024518
## ok 527 1248 446 0.4380910
## promoted 201 393 2177 0.2143630
Model exploration with the breakDown2 package is performed in three steps.
1. Create an explainer - wrapper around model and validation data.
Since all other functions work in a model agnostic fashion, first we need to define a wrapper around the model. Here we are using the explain() function from DALEX package.
explainer_rf <- explain(model,
data = HR,
y = HR$status)
2. Select an observation of interest.
Break Down Plots decompose model prediction around a single observation. Let’s construct a data frame with corresponding values.
new_observation <- data.frame(gender = factor("male", levels = c("male", "female")),
age = 57.7,
hours = 42.3,
evaluation = 2,
salary = 2)
predict(model, new_observation, type = "prob")
## fired ok promoted
## 1 0.808 0.19 0.002
## attr(,"class")
## [1] "matrix" "votes"
3. Calculate Break Down decomposition
The local_interactions() function calculates Break Down contributions for a selected model around a selected observation.
The result from local_interactions() function is a data frame with variable attributions.
library("breakDown2")
bd_rf <- local_interactions(explainer_rf,
new_observation)
bd_rf
## contribution
## randomForest.fired: baseline 0.376
## randomForest.fired: * hours = 42 0.239
## randomForest.fired: * salary = 2 -0.212
## randomForest.fired: * age:gender = 58:male 0.394
## randomForest.fired: * evaluation = 2 0.011
## randomForest.fired: prediction 0.808
## randomForest.ok: baseline 0.276
## randomForest.ok: * evaluation = 2 0.136
## randomForest.ok: * salary = 2 0.162
## randomForest.ok: * age:gender = 58:male -0.286
## randomForest.ok: * hours = 42 -0.097
## randomForest.ok: prediction 0.190
## randomForest.promoted: baseline 0.348
## randomForest.promoted: * hours = 42 -0.189
## randomForest.promoted: * evaluation = 2 -0.153
## randomForest.promoted: * age:gender = 58:male -0.005
## randomForest.promoted: * salary = 2 0.001
## randomForest.promoted: prediction 0.002
## baseline: 0
The generic plot() function creates a Break Down plots.
plot(bd_rf)
In the Section @ref(modelAgnosticAttribution) we show the problem related to the ordering of variables. In the Section @ref(breakDown) we show an approach in which the ordering was determined based on single step assessment of variable importance.
In this section we introduce other, very popular approach for additive variable attribution. The problem of contributions that depends on the variable ordering is solved by averaging over all possible orderings.
This method is motivated with results in cooperative game theory and was first introduced in (Štrumbelj and Kononenko 2014). Wide adoption of this method comes with a NIPS 2017 paper (Lundberg and Lee 2017) and python library SHAP https://github.com/slundberg/shap. Authors of the SHAP method introduced also efficient algorithm for tree-based models, see (Lundberg, Erion, and Lee 2018).
Since in sequential attribution effect depends on the ordering. Here the idea is to average across all possible orderings.
TODO: a nice example
The name Shapley Values comes from the solution in cooperative game theory attributed to Lloyd Shapley. The original problem was to assess how important is each player to the overall cooperation, and what payoff can he or she reasonably expect from the coalition? (Shapley 1953)
In the context of model interpretability the payoff is the average model response while the players are the variables in the conditioning. Then Formula for variable contributions is following.
\[ v(f, x^*, i) = \frac 1{|P|}\sum_{S \subseteq P\setminus \{i\}} {{|P|-1}\choose{|S|}}^{-1} \left(E [f(X) | X_{S \cup \{i\}} = x^*_{S \cup \{i\}}] - E [f(X) | X_{S} = x^*_{S}]\right) \] where \(P = \{1, \ldots, p\}\) is the set of all variables. The intuition beyond this contribution is following. We consider all possible orderings of variables (yes, there is \(2^p\) of them) and calculate the contribution of variable \(i\) as an average from contributions calculated in particular orderings.
The part \(E[f(X) | X_{S \cup \{i\}} = x^*_{S \cup \{i\}}] - E [f(X) | X_{S} = x^*_{S}]\) is the contribution of variable \(i\) which is introduces after variables from \(S\).
Time complexity of this method is \(O(2^p)\) where \(p\) stands for the number of variables. Such complexity makes this method impractical for most cases. Fortunately it is enough to assess this value. (Štrumbelj and Kononenko 2014) proposed to use sampling. (Lundberg, Erion, and Lee 2018) proposed fast implementations for tree based ensembles.
Properties
Shaply values have (as a single unique solution) following properties
Shapley Values give a uniform approach to decompose model prediction into parts that can be attributed additively to variables. Below we summarize key strengths and weaknesses of this approach.
Pros
Cons
Note that fully additive model solutions presented in sections @ref(modelAgnosticAttribution), @ref(breakDown) and @ref(shapley) lead to same variable contributions.
In this section we will present an example based on the HR dataset and Random Forest model (Breiman et al. 2018). See the Section @ref(HRdataset) for more details.
library("DALEX")
library("randomForest")
set.seed(123)
model_rf <- randomForest(status ~ gender + age + hours + evaluation + salary, data = HR)
First, we use a shapper package - a wrapper over SHAP python package.
library("shapper")
Y_train <- HR$status
x_train <- HR[ , -6]
x_train$gender <- as.numeric(x_train$gender)
model_rfs <- randomForest(x = x_train, y = Y_train)
p_fun <- function(x, data){
predict(x, newdata = data, type = "prob")
}
new_observation <- data.frame(gender = 1,
age = 57.7,
hours = 42.3,
evaluation = 2,
salary = 2)
x <- individual_variable_effect(x = model_rfs, data = x_train, predict_function = p_fun,
new_observation = new_observation)
#plot(x)
library("ggplot2")
x$`_vname_` <- reorder(x$`_vname_`, x$`_attribution_`, function(z) -sum(abs(z)))
levels(x$`_vname_`) <- paste(sapply(1:6, substr, x=" ", start=1), levels(x$`_vname_`))
ggplot(x, aes(x=`_vname_`, xend=`_vname_`,
yend = `_yhat_mean_`, y = `_yhat_mean_` + `_attribution_`,
color=`_sign_`)) +
geom_segment(arrow = arrow(length=unit(0.30,"cm"), ends="first", type = "closed")) +
geom_text(aes(label = round(`_attribution_`, 2)), nudge_x = 0.45) +
geom_segment(aes(x = "_predicted_",xend = "_predicted_",
y = `_yhat_`, yend = `_yhat_mean_`), size = 2, color="black",
arrow = arrow(length=unit(0.30,"cm"), ends="first", type = "closed")) +
geom_text(aes(x = "_predicted_",
y = `_yhat_`, label = round(`_yhat_`, 2)), nudge_x = 0.45, color="black") +
geom_hline(aes(yintercept = `_yhat_mean_`)) +
facet_grid(`_label_`~`_ylevel_`) +
scale_color_manual(values = c(`-` = "#d8b365", `0` = "#f5f5f5", `+` = "#5ab4ac",
X = "darkgrey")) +
coord_flip() + theme_minimal() + theme(legend.position="none") + xlab("") + ylab("")
Here we use the iml package, see more examples in (Molnar, Bischl, and Casalicchio 2018).
library("iml")
explainer_rf = Predictor$new(model_rf, data = HR, type="prob")
Explanations for a new observation.
new_observation <- data.frame(gender = factor("male", levels = c("male", "female")),
age = 57.7,
hours = 42.3,
evaluation = 2,
salary = 2,
status = factor("fired"))
shapley = Shapley$new(explainer_rf, x.interest = new_observation)
shapley
## Interpretation method: Shapley
## Predicted value: 0.768000, Average prediction: 0.375522 (diff = 0.392478) Predicted value: 0.232000, Average prediction: 0.275811 (diff = -0.043811) Predicted value: 0.000000, Average prediction: 0.348667 (diff = -0.348667)
##
## Analysed predictor:
## Prediction task: unknown
##
##
## Analysed data:
## Sampling from data.frame with 7847 rows and 6 columns.
##
## Head of results:
## feature class phi phi.var feature.value
## 1 gender fired 0.15184 0.06525201 gender=male
## 2 age fired 0.10980 0.07939374 age=57.7
## 3 hours fired 0.23736 0.11281486 hours=42.3
## 4 evaluation fired 0.03640 0.01645164 evaluation=2
## 5 salary fired -0.17504 0.04931632 salary=2
## 6 status fired 0.00000 0.00000000 status=fired
And the plot with Shapley attributions.
plot(shapley)
See more examples for iml package in the (Molnar 2018b) book.
A different approach to explanations of a single observations is through surrogate models. Models that easy to understand and are similar to black box model around the point of interest.
Variable attribution methods, that were presented in the Section @ref(breakDown) are not interested in the local curvature of the model. They rather compare model prediction against average model prediction and they use probability structure of the dataset.
The complementary approach would be to directly explore information about model curvature around point of interest. In the section @ref(ceterisParibus) we introduced Ceteris Paribus tool for such what-if analysis. But the limitation of ceteris Paribus pltos is that they explore changes along single dimension or pairs of dimensions.
In this section we describe an another approach based on local approximations with white-box models. This approach will also investigate local curvature of the model but indirectly, through surrogate white-box models.
The most known method from this class if LIME (Local Interpretable Model-Agnostic Explanations), introduced in the paper Why Should I Trust You?: Explaining the Predictions of Any Classifier (Ribeiro, Singh, and Guestrin 2016). This methods and it’s clones are now implemented in various R and python packages, see for example (Pedersen and Benesty 2018), (Staniak and Biecek 2018) or (Molnar 2018a).
The LIME method, and its clones, has following properties:
Therefore the objective is to find a local model \(M^L\) that approximates the black box model \(f\) in the point \(x^*\). As a solution the penalized loss function is used. The white-box model that is used for explanations satisfies following condition.
\[ M^*(x^*) = \arg \min_{g \in G} L(f, g, \Pi_{x^*}) + \Omega (g) \] where \(G\) is a family of white box models (e.g. linear models), \(\Pi_{x^*}\) is neighbourhood of \(x^*\) and \(\Omega\) stands for model complexity.
(fig:LIME1) A schematic idea behind local model approximations. Panel A shows training data, colors correspond to classess. Panel B showhs results fom the Random Forest model, whis is where the algorithm starts. Panel C shows new data sampled around the point of interest. Their color correspond to model response. Panel D shows fitted linear model that approximated the random forest model around point of interest
The algorithm is composed from three steps:
Identification of interpretable data representations
For image data, single pixel is not an interpretable feature. In this step the input space of the model is transformed to input space that is easier to understand for human. The image may be decomposed into parts and represented as presence/absence of some part of an image.
Local sampling around the point of interest
Once the interpretable data representation is identified, then the neighbourhood around point of interest needs to be explored.
Fitting a white box model in this neighbouhood
Any model that is easy to interpret may be fitted to this data, like decision tree or rule based system. However in practice the most common family of models are linear models.
Local approximations are model agnostic, can be applied to any predictive model. Below we summarize key strengths and weaknesses of this approach.
Pros
Cons
In this section we present example application of lime (Pedersen and Benesty 2018) and live (Staniak and Biecek 2018) packages. Note that this method is also implemented in iml (Molnar 2018a) and other packages. These pacakages differ in some details and also results in different explanations.
Model preparation
In this section we will present examples based on the HR dataset. See the Section @ref(HRdataset) for more details.
library("DALEX")
head(HR)
The problem here is to predict average price for square meter for an apartment. Let’s build a random forest model with randomForest package (Breiman et al. 2018).
library("randomForest")
rf_model <- randomForest(status ~ gender + age + hours + evaluation + salary, data = HR)
rf_model
##
## Call:
## randomForest(formula = status ~ gender + age + hours + evaluation + salary, data = HR)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 27.4%
## Confusion matrix:
## fired ok promoted class.error
## fired 2274 380 201 0.2035026
## ok 539 1235 447 0.4439442
## promoted 199 384 2188 0.2103934
new_observation <- data.frame(gender = factor("male", levels = c("male", "female")),
age = 57.7,
hours = 42.3,
evaluation = 2,
salary = 2)
predict(rf_model, new_observation, type = "prob")
## fired ok promoted
## 1 0.806 0.192 0.002
## attr(,"class")
## [1] "matrix" "votes"
library("lime")
model_type.randomForest <- function(x, ...) "classification"
lime_rf <- lime(HR[,1:5], rf_model)
explanations <- lime::explain(new_observation[,1:5], lime_rf, n_labels = 3, n_features = 3)
explanations
plot_features(explanations)
library("live")
new_observation$status <- "fired"
explainer_rf_fired <- explain(rf_model,
data = HR,
y = HR$status == "fired",
predict_function = function(m,x) predict(m,x, type = "prob")[,1],
label = "fired")
local_model <- local_approximation(explainer_rf_fired, new_observation,
target_variable_name = "status", n_new_obs = 500)
local_model
## Dataset:
## Observations: 500
## Variables: 6
## Response variable: status
## Explanation model:
## Name: regr.lm
## Variable selection wasn't performed
## Weights present in the explanation model
## R-squared: 0.7339
plot(local_model)
plot(local_model, type = "forest")
library("iml")
explainer_rf = Predictor$new(rf_model, data = HR[,1:5])
white_box = LocalModel$new(explainer_rf, x.interest = new_observation[,1:5], k = 5)
white_box
## Interpretation method: LocalModel
##
##
## Analysed predictor:
## Prediction task: unknown
##
##
## Analysed data:
## Sampling from data.frame with 7847 rows and 5 columns.
##
## Head of results:
## beta x.recoded effect x.original feature feature.value
## 1 0.070451023 1.0 0.07045102 male gender=male gender=male
## 2 0.005839379 57.7 0.33693215 57.7 age age=57.7
## 3 -0.090103309 42.3 -3.81136995 42.3 hours hours=42.3
## 4 -0.508252954 2.0 -1.01650591 2 evaluation evaluation=2
## 5 -0.032191570 2.0 -0.06438314 2 salary salary=2
## 6 -0.018323731 1.0 -0.01832373 male gender=male gender=male
## .class
## 1 fired
## 2 fired
## 3 fired
## 4 fired
## 5 fired
## 6 ok
plot(white_box)
TODO
compare pros and cons of different techniques
There are several use-cases for such explainers. Think about following.
Enslaving the Algorithm: From a ‘Right to an Explanation’ to a ‘Right to Better Decisions’? (Edwards and Veale 2018)
TODO: Sparse model approximation / variable selection / feature ranking
Model level explainers help to understand how the model works in general, for some population of interest. This is the main difference from the instance level explainers that were focused on a model behaviour around a single observation. Model level explainers work in the context of a population or subpopulation.
Think about following use-cases
All cases mentioned above are linked with either model diagnostic (checking if model behaves alog our expectations) or knowledge extraction (model was trained to extract some knowledge about the discipline).
Model level explanations are focused on four main aspects of a model.
In the spirit of three laws introduces in the chapter @ref(three-single-laws) here we propose three laws for model level explanations.
Methods presented in this chapter are useful for assessment of feature importance. There are many possible applications of such methods, for example:
There are many methods for assessment of feature importance. In general we may divide them into two groups, methods that are model specific and methods that are model agnostic.
Some models like random forest, gradient boosting, linear models and many others have their own ways to assess feature importance. Such method are linked with the particular structure of the model. In terms of linear models such specific measures are linked with normalized regression coefficients of p-values. For tree based ensembles such measures may be based on utilization of particular features in particular trees, see (Foster 2017) for gradient boosting or (Paluszynska and Biecek 2017a) for random forest.
But in this book we are focused on methods that are model agnostic. The may reason for that is
Model agnostic methods cannot assume anything about the model structure and we do not want to refit a model. The method that is presented below is described in details in the (Fisher, Rudin, and Dominici 2018). The main idea is to measure how much the model fit will decrease if a selected feature or group of features will be cancelled out. Here cancellation means perturbations like resampling from empirical distribution of just permutation.
The method can be used to measure importance of single features, pairs of features or larger tuples For the simplicity below we describe algorithm for single features, but it is straight forward to use it for larger subsets of features.
The idea behind is easy and in some sense borrowed from Random Forest (Breiman et al. 2018). If a feature is important then after permutation model performance shall drop. The larger drop the more important is the feature.
Let’s describe this idea in a bit more formal way. Let \(\mathcal L(f(x), y)\) be a loss function that assess goodness of fit for a model \(f(x)\) while let \(\mathcal X\) be a set of features.
Note that ranking of feature importance will be the same for the difference and the ratio since the loss \(L\) is the same.
Note also, that the main advantage of the step 5 is that feature importance is kind of normalized. But in many cases such normalization is not needed and in fact it makes more sense to present raw \(L^{*,-i}\) values.
## Distribution not specified, assuming bernoulli ...
Let’s use this approach to a random forest model created for the Titanic dataset. The goal is to predict passenger survival probability based on their sex, age, class, fare and some other features available in the titanic dataset.
head(titanic_small)
Permutation based feature importance can be calculated with the feature_importance{ingredients}. By default it permutes values feature by feature.
Instead of showing normalized feature importance we plot both original \(L\) and loss after permutation \(L^{*,-i}\). This way we can read also how good was the model, and as we will see in next subsection it will be useful for model comparison.
library("ingredients")
fi_rf <- feature_importance(explainer_rf)
plot(fi_rf) + ggtitle("Permutation based feature importance", "For Random Forest model and Titanic data")
Feature importance. Each interval presents the difference between original model performance (left end) and the performance on a dataset with a single feature perturbed
It’s interesting that the most important variable for Titanic data is the Sex. So it have been ,,women first’’ after all. Then the three features of similar importance are passenger class (first class has higher survival), age (kids have higher survival) and fare (owners of more pricy tickets have higher survival).
Note that drawing permutations evolves some randomness. Thus to have higher repeatability of results you may either set a seed for random number generator or replicate the procedure few times. The second approach has additional advantage, that you will learn the uncertainty behind feature importance assessment.
Here we present scores for 10 repetition of the process.
fi_rf10 <- replicate(10, feature_importance(explainer_rf), simplify = FALSE)
do.call(plot, fi_rf10) + ggtitle("Permutation based feature importance", "For Random Forest model and Titanic data")
Feature importance for 10 replication of feature importance assessment
It is much easier to assess feature importance if they come with some assessment of the uncertainty. We can read from the plot that Age and passenger class are close to each other.
Note that intervals are useful for model comparisons. In the Figure @ref{titanic5} we can read feature importance for random forest, gradient boosting and logistic regression models. Best results are achieved by the random forest model and also this method consume more features than others. A good example is the Fare variable, not used in gradient boosting not logistic regression (as a feature highly correlated with passenger class) but consumed in the random forest model.
fi_rf <- feature_importance(explainer_rf)
fi_gbm <- feature_importance(explainer_gbm)
fi_glm <- feature_importance(explainer_glm)
plot(fi_rf, fi_gbm, fi_glm)
Feature importance for random forest, gradient boosting and logistic regression models
Let’s create a regression model for prediction of apartment prices.
library("DALEX")
library("randomForest")
set.seed(59)
model_rf <- randomForest(m2.price ~ construction.year + surface + floor +
no.rooms + district, data = apartments)
A popular loss function for regression model is the root mean square loss \[ L(x, y) = \sqrt{\frac1n \sum_{i=1}^n (x_i - y_i)^2} \]
loss_root_mean_square(
predict(model_rf, apartments),
apartments$m2.price
)
## [1] 193.8477
Let’s calculate feature importance
explainer_rf <- explain(model_rf,
data = apartmentsTest[,2:6], y = apartmentsTest$m2.price)
vip <- variable_importance(explainer_rf,
loss_function = loss_root_mean_square)
vip
On a diagnostic plot is useful to present feature importance as an interval that start in a loss and ends in a loss of perturbed data.
plot(vip)
Much more can be read from feature importance plots if we compare models of a different structure. Let’s train three predictive models trained on apartments dataset from the DALEX package. Random Forest model (Breiman et al. 2018) (elastic but biased), Support Vector Machines model (Meyer et al. 2017) (large variance on boundaries) and Linear Model (stable but not very elastic). Presented examples are for regression (prediction of square meter price), but the CP profiles may be used in the same way for classification.
Let’s fit these three models.
library("DALEX")
model_lm <- lm(m2.price ~ construction.year + surface + floor +
no.rooms + district, data = apartments)
library("randomForest")
set.seed(59)
model_rf <- randomForest(m2.price ~ construction.year + surface + floor +
no.rooms + district, data = apartments)
library("e1071")
model_svm <- svm(m2.price ~ construction.year + surface + floor +
no.rooms + district, data = apartments)
For these models we use DALEX explainers created with explain() function. These explainers wrap models, predict functions and validation data.
explainer_lm <- explain(model_lm,
data = apartmentsTest[,2:6], y = apartmentsTest$m2.price)
vip_lm <- variable_importance(explainer_lm,
loss_function = loss_root_mean_square)
vip_lm
explainer_rf <- explain(model_rf,
data = apartmentsTest[,2:6], y = apartmentsTest$m2.price)
vip_rf <- variable_importance(explainer_rf,
loss_function = loss_root_mean_square)
vip_rf
explainer_svm <- explain(model_svm,
data = apartmentsTest[,2:6], y = apartmentsTest$m2.price)
vip_svm <- variable_importance(explainer_svm,
loss_function = loss_root_mean_square)
vip_svm
Let’s plot feature importance for all three models on a single plot.
Intervals start in a different values, thus we can read that loss for SVM model is the lowest.
When we compare other features it looks like in all models the district is the most important feature followed by surface and floor.
plot(vip_rf, vip_svm, vip_lm)
There is interesting difference between linear model and others in the way how important is the construction.year. For linear model this variable is not importance, while for remaining two models there is some importance.
In the next chapter we will see how this is possible.
What does the feature importance mean? How it is linked with a data distribution.
Methods presented in this chapter are useful for extraction information of feature effect, i.e. how a feature is linked with model response. There are many possible applications of such methods, for example:
## Distribution not specified, assuming bernoulli ...
One of the first and the most popular tools for model understanding are Partial Dependence Plots (sometimes named Partial Dependence Profiles) (Friedman 2000).
PDP was introduced by Friedman in 2000 in the paper devoted to Gradient Boosting Machines (GBM) - new type of complex yet effective models. For many years PDP as sleeping beauties stay in the shadow of the boosting. It has changed in recent years.
General idea is to show how the expected model response behaves as a function of a selected feature. Here the word ,,expected’’ means averaged over the population. We can think about them as about an average from Ceteris Paribus Profiles introduced in @ref{ceterisParibus}.
Let’s see how they are constructed step by step. Here we will use a random forest model created for the titanic dataset. Examples are related to a single variable Age.
As it was introduced in @ref{ceterisParibus} Ceteris Paribus profiles are calculated for observations. They show how model response change is a selected variable in this observation is modified.
\[ CP^{f, j, x}(z) := f(x|^j = z). \]
Such profiles can be calculated for example with the ceteris_paribus{ingredients} function.
library("ingredients")
selected_passangers <- select_sample(titanic_small, n = 100)
cp_rf <- ceteris_paribus(explainer_rf, selected_passangers)
So for a single model and a single variable we get a bunch of what-if profiles. In the figure @ref{pdp_part_1} we show an example for 100 observations. Despite some variation (random forest are not as stable as we would hope) we see that most profiles are decreasing. So the older the passengers is the lower is the survival probability.
Ceteris Paribus profiles for 100 observations, the Age variable and the random forest model
In the most common formulation Partial Dependency Plots are expected values for CP profiles (see pdp package (Greenwell 2017a)).
\[ g_i(z) = E_{x_{-i}}[ f(x|^i = z, x^{-i}) ]. \]
Of course, this expectation cannot be calculated directly as we do not know fully neither the distribution of \(x_{-i}\) nor the \(f()\). Yet this value may be estimated by
\[ \hat g_i(z) = \frac{1}{n} \sum_{j=1}^{n} f(x|^i = z, x_j^{-i}). \]
Such average can be calculated with the aggregate_profiles{ingredients} function.
library("ingredients")
selected_passangers <- select_sample(titanic_small, n = 100)
cp_rf <- ceteris_paribus(explainer_rf, selected_passangers)
pdp_rf <- aggregate_profiles(cp_rf, selected_variables = "Age")
So for a single model and a single variable we get a profile. See an example in figure @ref{pdp_part_2}. It is much easier than following 100 separate curves, and in cases in which Ceteris Paribus are more or less parallel, the Partial Dependency is a good summary of them.
The average response is of course more stable (as it’s an average) and in this case is more or less a decreasing curve. It’s much easier to notice that the older the passenger is the lower the survival probability. Moreover it is easier to notice that the largest drop in survival changes happen for teenagers. On average the survival for adults is 30 percent points smaller than for kids.
Partial Dependency profile as an average for 100 observations
As we said in the previous section, Partial Dependency is a good summary if Ceteris Paribus profiles are similar, i.e. parallel. But it may happen that the variable of interest is in interaction with some other variable. Then profiles are not parallel because the effect of variable of interest depends on some other variables.
So on one hand it would be good to summaries all this Ceteris Paribus profiles with smaller number of profiles. But on another hand a single aggregate may not be enough. To deal with this problem we propose to cluster Ceteris Paribus profiles and check how homogenous are these profiles.
The most straightforward approach would be to use a method for clustering, like k-means algorithm or hierarchical clustering, and see how these cluster of profiles behave. Once clusters are established we can aggregate within clusters in the same way as in case of Partial Dependency Plots.
Such clusters can be calculated with the cluster_profiles{ingredients} function. We choose the hierarchical clustering with Ward linkage as it gives most stable results.
library("ingredients")
selected_passangers <- select_sample(titanic_small, n = 100)
cp_rf <- ceteris_paribus(explainer_rf, selected_passangers)
clust_rf <- cluster_profiles(cp_rf, k = 3, selected_variables = "Age")
So for a single model and a single variable we get \(k\) profiles. The common problem in clustering is the selection of \(k\). However in our case, as it’s an exploration, the problem is simpler, as we are interesting if \(k=1\) (Partial Dependency is a good summary) or not (there are some interactions).
See an example in Figure @ref{pdp_part_4}. It is easier to notice that Ceteris Paribus profiles can be groups in three clusters. Group of passengers with a very large drop in the survival (cluster 1), moderate drop (cluster 2) and almost no drop in survival (cluster 3). Here we do not know what other factors are linked with these clusters, but some additional exploratory analysis can be done to identify these factors.
Cluster profiles for 3 clusters over 100 Ceteris Paribus profiles
Once we see that variable of interest may be in interaction with some other variable, it is tempting to look for the factor that distinguish clusters.
The most straightforward approach is to use some other variable as a grouping variable. This can be done by setting the groups argument in the aggregate_profiles{ingredients} function.
library("ingredients")
selected_passangers <- select_sample(titanic_small, n = 100)
cp_rf <- ceteris_paribus(explainer_rf, selected_passangers)
pdp_Sex_rf <- aggregate_profiles(cp_rf, selected_variables = "Age",
groups = "Sex")
See an example in Figure @ref{pdp_part_5}. Clearly there is an interaction between Age and Sex. The survival for woman is more stable, while for man there is more sudden drop in Survival for older passengers. Check how the interaction for Pclass (passenger class) looks like.
Grouped profiles with respect to the Sex variable
Contrastive comparisons of Partial Dependency Plots are useful not only for subgroups of observations but also for model comparisons.
Why one would like to compare models? There are at least three reasons for it.
Generic plot{ingredients} function handles multiple models as consecutive arguments.
library("ingredients")
plot(pdp_rf, pdp_glm, pdp_gbm, selected_variables = "Age", color = "_label_")
See an example in Figure @ref{pdp_part_7}. Random forest is compared with gradient boosting model and generalized linear model (logistic regression). All three models agree when it comes to a general relation between Age and Survival. Logistic regression is of course the most smooth. Gradient boosting has on average higher predictions than random forest.
Comparison on three predictive models with different structures.
One of the largest advantages of the Partial Dependency Profiles is that they are easy to explain, as they are just an average across Ceteris Paribus profiles. But one of the largest disadvantages lies in assumptions of CPs. Profiles are created based on assumption that it makes sense to change variable \(x^i\) independently from all other variables \(x^{-i}\).
But this may not have sense at all. Features like \(surface\) and \(number.or.rooms\) are strongly correlated as apartments with larger number of rooms usually have larger surface. It may makes no sense to consider an apartment with 10 rooms and 20 square meters, so it may be misleading to change \(x^{surface}\) independently from \(x^{number.of.rooms}\).
There are several attempts to fix this problem. One of the most known are Accumulated Local Effects Plots (ALEPlots) (Apley 2018) or Local Conditional Expectation Profiles (LCE) [TODO: referencja do pracy Rafala]?
(Demšar and Bosnić 2018)
(Greenwell 2017b) (Puri et al. 2017)
(Sitko, Grudziąż, and Biecek 2018)
(Strobl et al. 2007) (Strobl et al. 2008) - variable importance
(Fisher, Rudin, and Dominici 2018)
Beware Default Random Forest Importances
Terence Parr, Kerem Turgutlu, Christopher Csiszar, and Jeremy Howard March 26, 2018.
http://explained.ai/rf-importance/index.html
(Sitko, Grudziąż, and Biecek 2018)
library(factorMerger)
(Paluszynska and Biecek 2017b) (Goldstein, Kapelner, and Bleich 2017) (Apley 2018)
(Tatarynowicz, Romaszko, and Urbański 2018)
Goal: how good is the model, which is better
Model selection
library("auditor")
library("DALEX2")
library("ranger")
library("e1071")
rf_model <- ranger(life_length ~ ., data = dragons)
lm_model <- lm(life_length ~ ., data = dragons)
svm_model <- svm(life_length ~ ., data = dragons)
predict_function <- function(m,x,...) predict(m, x, ...)$predictions
rf_au <- audit(rf_model, data = dragons, y = dragons$life_length,
predict.function = predict_function)
lm_au <- audit(lm_model, data = dragons, y = dragons$life_length)
svm_au <- audit(svm_model, data = dragons, y = dragons$life_length)
plotResidualBoxplot(rf_au, lm_au, svm_au)
plotRROC(rf_au, lm_au, svm_au)
Goal: verify if model is ok
(Gosiewska and Biecek 2018)
library("auditor")
library("DALEX2")
library("ranger")
rf_model <- ranger(life_length ~ ., data = dragons)
predict_function <- function(m,x,...) predict(m, x, ...)$predictions
rf_au <- audit(rf_model, data = dragons, y = dragons$life_length,
predict.function = predict_function)
check_residuals(rf_au)
## -----------------------------------------------
## Checks for autocorrelation
## -----------------------------------------------
## Model name: ranger
## Autocorrelation in target: +0.01
## Autocorrelation in residuals: +0.01
## -----------------------------------------------
## Checks for outliers
## -----------------------------------------------
## Model name: ranger
## Shift > 1: 20 ( 1 %)
## Shift > 2: 14 ( 0.7 %)
## Top lowest standardised residuals:
## -2.5534 (1888), -2.4594 (920), -2.4588 (1388), -2.4019 (1508), -2.2377 (1829)
## Top highest standardised residuals:
## 15.242 (1914), 11.256 (1745), 9.7175 (1532), 9.6979 (1111), 8.391 (1051)
## -----------------------------------------------
## Checks for trend in residuals
## -----------------------------------------------
## Model name: ranger
## Standardised loess fit: +76.75 ***
plotResidualBoxplot(rf_au)
plotResidual(rf_au, variable = "Observed response")
plotScaleLocation(rf_au)
plotRROC(rf_au)
plotAutocorrelation(rf_au)
Machine learning models are often fitted and validated on historical data under silent assumption that data are stationary. The most popular techniques for validation (k-fold cross-validation, repeated cross-validation, and so on) test models on data with the same distribution as training data.
Yet, in many practical applications, deployed models are working in a changing environment. After some time, due to changes in the environment, model performance may degenerate, as model may be less reliable.
Concept drift refers to the change in the data distribution or in the relationships between variables over time. Think about model for energy consumption for a school, over time the school may be equipped with larger number of devices of with more power-efficient devices that may affect the model performance.
In this chapter we define basic ideas behind concept drift and propose some solutions.
In general, concept drift means that some statistical properties of variables used in the model change over time. This may result in degenerated performance. Thus the early detection of concept drift is very important, as it is needed to adapt quickly to these changes.
The term concept usually refers to target variable, but generally, it can also refer to model input of relations between variables.
The most general formulation of a concept drift refers to changes in joint distribution of \(p(X, y)\). It is useful to define also following measures.
Once the drift is detected one may re-fit the model on newer data or update the model.
Covariate Drift is a change in distribution of input, change in the distribution of \(p(X)\). The input is a \(p\)-dimensional vector with variables of possible mixed types and distributions.
Here we propose a simple one-dimensional method, that can be applied to each variable separately despite of its type. We do not rely on any formal statistical test, as the power of the test depends on sample size and for large samples the test will detect even small differences.
We also consider an use-case for two samples. One sample gathers historical ,,old’’ data, this may be data available during the model development (part of it may be used as training and part as test data). Second sample is the current ,,new’’ data, and we want to know is the distribution of \(X_{old}\) differs from the distribution of \(X_{new}\).
There is a lot of distances between probability measures that can be used here (as for example Wasserstein, Total Variation and so on). We are using the Non-Intersection Distance due to its easy interpretation.
For categorical variables \(P\) and \(Q\) non-intersection distance is defined as \[ d(P,Q) = 1 - \sum_{i\in \mathcal X} \min(p_i, q_i) \] where \(\mathcal X\) is a set of all possible values while \(p_i\) and \(q_i\) are probabilities for these values in distribution \(P\) and \(Q\) respectively. An intuition behind this distance is that it’s amount of the distribution \(P\) that is not shared with \(Q\) (it’s symmetric). The smaller the value the closes are these distributions.
For continuous variables we discretize their distribution in the spirit of \(\chi^2\) test.
Here we are going to use the drifter package that implements some tools for concept drift detection.
As an illustration we use two datasets from the DALEX2 package, namely apartments (here we do not have drift) and dragons (here we do have drift).
library("DALEX2")
library("drifter")
# here we do not have any drift
head(apartments, 2)
d <- calculate_covariate_drift(apartments, apartments_test)
d
# here we do have drift
head(dragons, 2)
d <- calculate_covariate_drift(dragons, dragons_test)
d
Perhaps the most obvious negative effect of the concept drift is that the model performance degrades over time.
But this is also something that is straightforward to verify. One can calculate distribution of residuals on new data and compare this distribution with residuals obtained on old data.
Again, we have two samples, residuals calculated on the old dataset
\[ r_{old} = y_{old} - \hat y_{old} = y_{old} - f_{old}(X_{old}) \] versus residuals calculated on the new dataset \[ r_{new} = y_{new} - \hat y_{new} = y_{new} - f_{old}(X_{new}) \]
We can use any distance between distributions to compare \(r_{new}\) and \(r_{old}\), for example the non-intersection distance.
Here we are going to use the drifter package.
library("DALEX2")
library("drifter")
library("ranger")
data_old <- apartments_test[1:4000,]
data_new <- apartments_test[4001:8000,]
predict_function <- function(m,x,...) predict(m, x, ...)$predictions
model_old <- ranger(m2.price ~ ., data = apartments)
calculate_residuals_drift(model_old,
data_old, data_new,
data_old$m2.price,
data_new$m2.price,
predict_function = predict_function)
Model Drift is a change in the relation between target variable and input variables, change in \(p(y|X)\). The input is a \(p\)-dimensional vector with variables of possible mixed types and distributions.
Here we propose a simple one-dimensional method based on Partial Dependency Plots introduced in the Chapter @ref(partialDependence). PDP profiles summaries marginal relation between \(\hat y\) and variable \(x_i\). The idea behind concept drift is to compare two models, the old model \(f_{old}\) and model refitted on the new data \(f_{new}\) and compare these models through PDP profiles.
For each variable we can obtain scores for drift calculated as \(L_2\) distance between PDP profiles for both models.
\[ drift_{i} = \frac 1 {|Z_i|}\int_{z\in Z_i} (PDP_i(f_{old}) - PDP_i(f_{new}))^2 dz \] where \(Z_i\) is the set of values for variable \(x_i\) (for simplicity we assume that it’s an interval) while \(PDP_i(f_{new})\) is the PDP profile for variable \(i\) calculated for the model \(f_{new}\).
Here we are going to use the drifter package. Instead of using old and new data here we compare model trained on data with males versus new dataset that contain data for females.
But, because of the interaction of gender and age, models created on these two datasets are different.
library("DALEX2")
library("drifter")
library("ranger")
predict_function <- function(m,x,...) predict(m, x, ..., probability=TRUE)$predictions[,1]
data_old = HR[HR$gender == "male", -1]
data_new = HR[HR$gender == "female", -1]
model_old <- ranger(status ~ ., data = data_old, probability = TRUE)
model_new <- ranger(status ~ ., data = data_new, probability = TRUE)
calculate_model_drift(model_old, model_new,
HR_test,
HR_test$status == "fired",
max_obs = 1000,
predict_function = predict_function)
library("ceterisParibus2")
prof_old <- individual_variable_profile(model_old,
data = data_new,
new_observation = data_new[1:1000,],
label = "model_old",
predict_function = predict_function)
prof_new <- individual_variable_profile(model_new,
data = data_new,
new_observation = data_new[1:1000,],
label = "model_new",
predict_function = predict_function)
plot(prof_old, prof_new,
selected_variables = "age", aggregate_profiles = mean,
show_observations = FALSE, color = "_label_", alpha = 1)
In this chapter we present an artificial dataset from Human Resources department in a Call Center.
The dataset is available in the DALEX package (Biecek 2018c). Each row corresponds to a single employee in a call center. Features like gender, age, average number of working hours per week, grade from the last evaluation and level of salary are used as predictive features.
The goal here is to first build a model, that will guess when to fire and when to promote an employer, so it’s a classification problem with three classes.
Why we need such model? We want to have objective decisions. That will not be subject to personal preferences of a manager. But is it possible to have an objective model? Would it be fair or it will just replicate some unfairness?
We will use this example to show how to use prediction level explainers to better understand how the model works for selected cases.
library("DALEX")
head(HR)
In this book we are focused on model exploration rather than model building, thus for sake ok simplicity we will use two default models created with random forest (Breiman et al. 2018) and generalized linear model (Ripley 2016).
set.seed(59)
library("randomForest")
model_rf <- randomForest(status ~ gender + age + hours + evaluation + salary, data = HR)
library("nnet")
model_glm <- multinom(status ~ gender + age + hours + evaluation + salary, data = HR)
## # weights: 21 (12 variable)
## initial value 8620.810629
## iter 10 value 7002.127738
## iter 20 value 6239.478146
## iter 20 value 6239.478126
## iter 20 value 6239.478124
## final value 6239.478124
## converged
In this chapter we present an artificial dataset related to prediction of prices for appartments in Warsaw. This dataset wil be used to discuss pros and cons for different techniques of model level explainers.
The dataset is available in the DALEX package (Biecek 2018c). Each row corresponds to a single apartment. Features like surface, number of rooms, district or floor are used as predictive features.
The problem here is to predict price of a square meter for an appartment, so it’s a regression problem with continouse outcome.
library("DALEX")
head(apartments)
The goal here is to predict average price for square meter for an apartment. Let’s build a random forest model with randomForest package (Breiman et al. 2018).
library("randomForest")
model_rf <- randomForest(m2.price ~ construction.year + surface + floor + no.rooms + district, data = apartments)
model_rf
##
## Call:
## randomForest(formula = m2.price ~ construction.year + surface + floor + no.rooms + district, data = apartments)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 82079.37
## % Var explained: 90.01
And a linear model.
model_lm <- lm(m2.price ~ construction.year + surface + floor + no.rooms + district, data = apartments)
model_lm
##
## Call:
## lm(formula = m2.price ~ construction.year + surface + floor +
## no.rooms + district, data = apartments)
##
## Coefficients:
## (Intercept) construction.year surface
## 5020.139 -0.229 -10.238
## floor no.rooms districtBielany
## -99.482 -37.730 17.214
## districtMokotow districtOchota districtPraga
## 918.380 926.254 -37.105
## districtSrodmiescie districtUrsus districtUrsynow
## 2080.611 29.942 -18.865
## districtWola districtZoliborz
## -16.891 889.973
Here we present list of arguments in explainers from DrWhy. All explainers use unified set of arguments. All of them are generic with two specific implementations *.explainer and *.default. The first one is working for objects created with DALEX2::explain() function.
Common core of arguments
x a model to be explained, or an explainer created with function DALEX2::explain().data validation dataset. Used to determine univariate distributions, calculation of quantiles, correlations and so on. It will be extracted from x if it’s an explainer.predict_function predict function that operates on the model x. Since the model is a black box, the predict_function is the only interface to access values from the model. It should be a function that takes at least a model x and data and returns vector of predictions. If model response has more than a single number (like multiclass models) then this function should return a marix/data.frame of the size m x d, where m is the number of observations while d is the dimensionality of model response. It will be extracted from x if it’s an explainer.new_observation an observation/observations to be explained. Required for local/instance level explainers. Columns in should correspond to columns in the data argument.... other parameters.label name of the model. By default it’s extracted from the class attribute of the modelFunction specific arguments
keep_distributions if TRUE, then distributions of partial predictions is stored and can be plotted with the generic plot().Apley, Dan. 2018. ALEPlot: Accumulated Local Effects (Ale) Plots and Partial Dependence (Pd) Plots. https://CRAN.R-project.org/package=ALEPlot.
Bach, Sebastian, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. 2015. “On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation.” Edited by Oscar Deniz Suarez. PLOS ONE 10 (7): e0130140. https://doi.org/10.1371/journal.pone.0130140.
Biecek, Przemyslaw. 2018a. BreakDown: Model Agnostic Explainers for Individual Predictions. https://CRAN.R-project.org/package=breakDown.
———. 2018b. CeterisParibus: Ceteris Paribus Profiles. https://pbiecek.github.io/ceterisParibus/.
———. 2018c. DALEX: Descriptive mAchine Learning Explanations. https://pbiecek.github.io/DALEX/.
Breiman, Leo, Adele Cutler, Andy Liaw, and Matthew Wiener. 2018. RandomForest: Breiman and Cutler’s Random Forests for Classification and Regression. https://CRAN.R-project.org/package=randomForest.
Cortez, Paulo, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. 2009. “Modeling Wine Preferences by Data Mining from Physicochemical Properties.” Decision Support Systems 47 (4): 547–53. https://doi.org/10.1016/j.dss.2009.05.016.
Demšar, Jaka, and Zoran Bosnić. 2018. “Detecting Concept Drift in Data Streams Using Model Explanation.” Expert Systems with Applications 92 (February): 546–59. https://doi.org/10.1016/j.eswa.2017.10.003.
Edwards, Lilian, and Michael Veale. 2018. “Enslaving the Algorithm: From a ‘Right to an Explanation’ to a ‘Right to Better Decisions’?” IEEE Security and Privacy 16 (3): 46–54. https://doi.org/10.1109/MSP.2018.2701152.
Fisher, Aaron, Cynthia Rudin, and Francesca Dominici. 2018. “Model Class Reliance: Variable Importance Measures for Any Machine Learning Model Class, from the ’Rashomon’ Perspective.” Journal of Computational and Graphical Statistics. http://arxiv.org/abs/1801.01489.
Fisher, A., C. Rudin, and F. Dominici. 2018. “Model Class Reliance: Variable Importance Measures for any Machine Learning Model Class, from the ‘Rashomon’ Perspective.” ArXiv E-Prints, January.
Foster, David. 2017. XgboostExplainer: An R Package That Makes Xgboost Models Fully Interpretable. https://github.com/AppliedDataSciencePartners/xgboostExplainer/.
———. 2018. XgboostExplainer: XGBoost Model Explainer.
Friedman, Jerome H. 2000. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics 29: 1189–1232.
Goldstein, Alex, Adam Kapelner, and Justin Bleich. 2017. ICEbox: Individual Conditional Expectation Plot Toolbox. https://CRAN.R-project.org/package=ICEbox.
Goldstein, Alex, Adam Kapelner, Justin Bleich, and Emil Pitkin. 2015. “Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual Conditional Expectation.” Journal of Computational and Graphical Statistics 24 (1): 44–65. https://doi.org/10.1080/10618600.2014.907095.
Gosiewska, Alicja, and Przemyslaw Biecek. 2018. Auditor: Model Audit - Verification, Validation, and Error Analysis. https://CRAN.R-project.org/package=auditor.
Greenwell, Brandon M. 2017a. “Pdp: An R Package for Constructing Partial Dependence Plots.” The R Journal 9 (1): 421–36. https://journal.r-project.org/archive/2017/RJ-2017-016/index.html.
———. 2017b. “pdp: An R Package for Constructing Partial Dependence Plots.” The R Journal 9 (1): 421–36. https://journal.r-project.org/archive/2017/RJ-2017-016/index.html.
Harrell Jr, Frank E. 2018. Rms: Regression Modeling Strategies. https://CRAN.R-project.org/package=rms.
Liaw, Andy, and Matthew Wiener. 2002. “Classification and Regression by randomForest.” R News 2 (3): 18–22. https://CRAN.R-project.org/doc/Rnews/.
Lundberg, Scott M., Gabriel G. Erion, and Su-In Lee. 2018. “Consistent Individualized Feature Attribution for Tree Ensembles.” CoRR abs/1802.03888. http://arxiv.org/abs/1802.03888.
Lundberg, Scott M, and Su-In Lee. 2017. “A Unified Approach to Interpreting Model Predictions.” In Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 4765–74. Curran Associates, Inc. http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf.
Meyer, David, Evgenia Dimitriadou, Kurt Hornik, Andreas Weingessel, and Friedrich Leisch. 2017. E1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), Tu Wien. https://CRAN.R-project.org/package=e1071.
Molnar, Christoph. 2018a. Iml: Interpretable Machine Learning. https://CRAN.R-project.org/package=iml.
———. 2018b. Interpretable Machine Learning. https://christophm.github.io/interpretable-ml-book/.
Molnar, Christoph, Bernd Bischl, and Giuseppe Casalicchio. 2018. “Iml: An R Package for Interpretable Machine Learning.” JOSS 3 (26). ournal of Open Source Software: 786. https://doi.org/10.21105/joss.00786.
O’Connell, Mark, Catherine Hurley, and Katarina Domijan. 2017. “Conditional Visualization for Statistical Models: An Introduction to the Condvis Package in R.” Journal of Statistical Software, Articles 81 (5): 1–20. https://doi.org/10.18637/jss.v081.i05.
O’Neil, Cathy. 2016. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. New York, NY, USA: Crown Publishing Group.
Paluszynska, Aleksandra, and Przemyslaw Biecek. 2017a. RandomForestExplainer: A Set of Tools to Understand What Is Happening Inside a Random Forest. https://github.com/MI2DataLab/randomForestExplainer.
———. 2017b. RandomForestExplainer: Explaining and Visualizing Random Forests in Terms of Variable Importance. https://CRAN.R-project.org/package=randomForestExplainer.
Pedersen, Thomas Lin, and Michaël Benesty. 2018. Lime: Local Interpretable Model-Agnostic Explanations. https://CRAN.R-project.org/package=lime.
Puri, Nikaash, Piyush Gupta, Pratiksha Agarwal, Sukriti Verma, and Balaji Krishnamurthy. 2017. “MAGIX: Model Agnostic Globally Interpretable Explanations.” CoRR abs/1706.07160. http://arxiv.org/abs/1706.07160.
Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. 2016. “Why Should I Trust You?: Explaining the Predictions of Any Classifier.” In, 1135–44. ACM Press. https://doi.org/10.1145/2939672.2939778.
Ridgeway, Greg. 2017. Gbm: Generalized Boosted Regression Models. https://CRAN.R-project.org/package=gbm.
Ripley, Brian. 2016. Nnet: Feed-Forward Neural Networks and Multinomial Log-Linear Models. https://CRAN.R-project.org/package=nnet.
Shapley, Lloyd S. 1953. “A Value for N-Person Games.” In Contributions to the Theory of Games Ii, edited by Harold W. Kuhn and Albert W. Tucker, 307–17. Princeton: Princeton University Press.
Simonyan, Karen, Andrea Vedaldi, and Andrew Zisserman. 2013. “Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps.” CoRR abs/1312.6034. http://arxiv.org/abs/1312.6034.
Sitko, Agnieszka, Aleksandra Grudziąż, and Przemyslaw Biecek. 2018. FactorMerger: The Merging Path Plot. https://CRAN.R-project.org/package=factorMerger.
Staniak, Mateusz, and Przemysław Biecek. 2018. Live: Local Interpretable (Model-Agnostic) Visual Explanations. https://CRAN.R-project.org/package=live.
Strobl, Carolin, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin, and Achim Zeileis. 2008. “Conditional Variable Importance for Random Forests.” BMC Bioinformatics 9 (1): 307. https://doi.org/10.1186/1471-2105-9-307.
Strobl, Carolin, Anne-Laure Boulesteix, Achim Zeileis, and Torsten Hothorn. 2007. “Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution.” BMC Bioinformatics 8 (1): 25. https://doi.org/10.1186/1471-2105-8-25.
Štrumbelj, Erik, and Igor Kononenko. 2014. “Explaining Prediction Models and Individual Predictions with Feature Contributions.” Knowledge and Information Systems 41 (3): 647–65. https://doi.org/10.1007/s10115-013-0679-x.
Tatarynowicz, Magda, Kamil Romaszko, and Mateusz Urbański. 2018. ModelDown: Make Static Html Website for Predictive Models. https://github.com/MI2DataLab/modelDown.
Xie, Yihui. 2018. Bookdown: Authoring Books and Technical Documents with R Markdown. https://CRAN.R-project.org/package=bookdown.